The data relates to a phone call marketing campaign directed by a banking institution to predict whether or not a client will participate in a term deposit. Term deposits are considered to be a more secure investment opportunity, considered to be somewhat protected from market fluctuations, as opposed to stocks. Generally, a client will invest a specific sum for a set amount of time (e.g. 5 months) with a predetermined interest rate. The investment is then pulled after the time has passed or prior, typically with a cost penalty.
The dataset contains all contact attempts to the clients, which can be multiple times to determine whether or not the client will subscribe to a term deposit (campaign). In total, there are 41,188 total observations. For social and economic context attributes, keep in mind that the indicators are assumed to be pulled from the general demographic, and is hence normalizing the data.
https://archive.ics.uci.edu/ml/datasets/bank%20marketing
Input variables:
1 - age (numeric) 2 - job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’) 3 - marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed) 4 - education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’) 5 - default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’) 6 - housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’) 7 - loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
21 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
$ age : int $ job : Factor $ marital : Factor $ education : Factor $ default : Factor $ housing : Factor $ loan : Factor $ contact : Factor $ month : Factor $ day_of_week : Factor $ duration : int $ campaign : int $ pdays : int $ previous : int $ poutcome : Factor $ emp.var.rate : Factor $ cons.price.idx: Factor $ cons.conf.idx : Factor $ euribor3m : Factor $ nr.employed : Factor $ y : Factor
Convert required data from vector to int.
#EDA - Determining significant columns ****** 1. Pdays - a significant number of the observations do not have prior contacts, hence they do not have days after the first contacts filed (pdays not equal to 999). 3.68% of the observations (1,151 records) have had more than one contact. This appears to be an insignificant column.
Previous - while ~14% of the clients were contacted prior to the current campaign, 11% were contacted only 1 prior, leaving only 3% that were contacted more than once, prior.
Campaigns - potentially significant
Duration - This should be removed considering that the dataset states: for a realistic predictive model, this factor should not be considered.
There are no missing values -> this is due to the fact that missing values are designated as unknown.
Columns with missing data - Job 330 (type of job) - Marital 80 - Eduation 1731 (highest education received) - Default 8597 (whether or not they have credit in default - failure to pay) - Housing 990 (has a housing loan or not) - Loan 990 (personal loan or not)
Total number of observations: 41,188 Total number of observations with at least one missing value: 10,700
~26% of the observations have at least one missing data record.
The summary statistics do not initally appear to have a significant skew from the full data. A concern with the data is that all columns with missing data points are categorical. Majority of the rows only have one missing, with less than 20% having more than one missing portion of the record.
Options: Ignore observations - not ideal Ignore variable - TBD in analysis Develop model to predict missing values Treat missing data as just another category - Recommended
FULL DATASET: Visual for numeric, color categorized by whether or not the client participated in term deposit. MISSING DATA: Visual for numeric, color categorized by whether or not the client participated in term deposit.
Visual for categorical, color categorized by whether or not the client participated in term deposit.
There are way more No’s than Yes for the response variable. In order to balance out the dataset, after the training split, the data is downsampled to prevent the entire minority group from being entirely excluded from the test set. The downsampled training set is then removed from the full data set to result in the test set.
The primary focus of objective 1 is to ensure that interpretability is preserved, while attempting to create an accurace model that predicts efficiently.
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "duration" "campaign"
## [13] "pdays" "previous" "poutcome" "emp.var.rate"
## [17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
## [21] "y" "pdays_0" "ID"
## age job marital education
## Min. :19.00 admin. :621 divorced: 256 university.degree :756
## 1st Qu.:32.00 blue-collar:452 married :1407 high.school :511
## Median :38.00 technician :378 single : 696 professional.course:326
## Mean :40.52 services :190 unknown : 5 basic.9y :302
## 3rd Qu.:48.00 management :168 basic.4y :230
## Max. :89.00 retired :141 unknown :124
## (Other) :414 (Other) :115
## default housing loan contact month
## no :1996 no :1078 no :1980 cellular :1709 may :639
## unknown: 368 unknown: 54 unknown: 54 telephone: 655 jul :392
## yes : 0 yes :1232 yes : 330 aug :334
## jun :294
## nov :235
## apr :200
## (Other):270
## day_of_week duration campaign previous
## fri:420 Min. : 5.0 Min. : 1.000 Min. :0.0000
## mon:447 1st Qu.: 146.8 1st Qu.: 1.000 1st Qu.:0.0000
## thu:508 Median : 265.0 Median : 2.000 Median :0.0000
## tue:480 Mean : 382.5 Mean : 2.267 Mean :0.3249
## wed:509 3rd Qu.: 511.0 3rd Qu.: 3.000 3rd Qu.:0.0000
## Max. :4199.0 Max. :23.000 Max. :6.0000
##
## poutcome emp.var.rate.V1 cons.price.idx.V1 cons.conf.idx.V1
## failure : 284 Min. : 1.000000 Min. : 1.000000 Min. : 1.000000
## nonexistent:1836 1st Qu.: 5.000000 1st Qu.: 9.000000 1st Qu.:10.000000
## success : 244 Median : 6.000000 Median :14.000000 Median :18.000000
## Mean : 6.861675 Mean :14.280457 Mean :15.228849
## 3rd Qu.:10.000000 3rd Qu.:19.000000 3rd Qu.:20.000000
## Max. :10.000000 Max. :26.000000 Max. :26.000000
##
## euribor3m.V1 nr.employed.V1 y pdays_0
## Min. : 1.00000 Min. : 1.000000 no :1182 Min. : 0.0000
## 1st Qu.:198.00000 1st Qu.: 6.000000 yes:1182 1st Qu.: 0.0000
## Median :268.00000 Median : 9.000000 Median : 0.0000
## Mean :226.07953 Mean : 7.759729 Mean : 0.6882
## 3rd Qu.:304.00000 3rd Qu.:11.000000 3rd Qu.: 0.0000
## Max. :315.00000 Max. :11.000000 Max. :27.0000
##
## y
## housing no yes
## no 16596 2026
## unknown 883 107
## yes 19069 2507
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = trainingsData2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.7857 -0.3569 -0.0356 0.4356 2.5966
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.015e+00 8.093e-01 2.490 0.01278 *
## age 1.298e-03 8.261e-03 0.157 0.87517
## jobblue-collar -4.388e-01 2.601e-01 -1.687 0.09157 .
## jobentrepreneur -1.439e-01 3.935e-01 -0.366 0.71452
## jobhousemaid -3.949e-02 4.637e-01 -0.085 0.93212
## jobmanagement -5.238e-02 2.918e-01 -0.179 0.85756
## jobretired 6.214e-01 3.784e-01 1.642 0.10056
## jobself-employed -5.731e-01 3.849e-01 -1.489 0.13652
## jobservices 8.818e-03 2.920e-01 0.030 0.97591
## jobstudent 3.180e-01 3.754e-01 0.847 0.39690
## jobtechnician 1.197e-01 2.515e-01 0.476 0.63401
## jobunemployed 3.631e-01 4.195e-01 0.866 0.38675
## jobunknown 7.299e-01 7.593e-01 0.961 0.33642
## maritalmarried 7.402e-02 2.238e-01 0.331 0.74083
## maritalsingle 2.718e-01 2.601e-01 1.045 0.29602
## maritalunknown -6.659e-01 1.153e+00 -0.578 0.56344
## educationbasic.6y 1.946e-01 4.114e-01 0.473 0.63613
## educationbasic.9y -4.607e-02 3.100e-01 -0.149 0.88184
## educationhigh.school -7.125e-02 3.132e-01 -0.228 0.82003
## educationilliterate 7.542e+00 3.247e+02 0.023 0.98147
## educationprofessional.course 1.135e-01 3.395e-01 0.334 0.73806
## educationuniversity.degree 4.729e-01 3.093e-01 1.529 0.12631
## educationunknown 2.401e-01 4.001e-01 0.600 0.54850
## defaultunknown -1.502e-01 2.193e-01 -0.685 0.49343
## housingunknown -1.346e-01 4.543e-01 -0.296 0.76708
## housingyes 1.648e-01 1.400e-01 1.177 0.23911
## loanunknown NA NA NA NA
## loanyes -1.709e-01 1.957e-01 -0.873 0.38257
## contacttelephone -1.757e-01 2.457e-01 -0.715 0.47458
## monthaug -6.475e-01 3.847e-01 -1.683 0.09239 .
## monthdec 8.580e-01 1.121e+00 0.765 0.44411
## monthjul -4.859e-01 3.075e-01 -1.580 0.11406
## monthjun 1.156e-01 3.060e-01 0.378 0.70570
## monthmar 4.244e-01 4.117e-01 1.031 0.30266
## monthmay -1.808e+00 2.533e-01 -7.138 9.46e-13 ***
## monthnov -1.112e+00 3.666e-01 -3.033 0.00242 **
## monthoct -2.604e-01 4.301e-01 -0.605 0.54496
## monthsep 1.943e-01 7.115e-01 0.273 0.78480
## day_of_weekmon -3.061e-01 2.270e-01 -1.349 0.17742
## day_of_weekthu -2.650e-01 2.244e-01 -1.181 0.23770
## day_of_weektue -3.962e-01 2.234e-01 -1.774 0.07609 .
## day_of_weekwed -1.418e-01 2.232e-01 -0.635 0.52535
## duration 7.914e-03 3.860e-04 20.504 < 2e-16 ***
## campaign -4.655e-02 4.054e-02 -1.148 0.25083
## previous 3.666e-02 2.278e-01 0.161 0.87215
## poutcomenonexistent 4.321e-01 3.327e-01 1.299 0.19403
## poutcomesuccess 1.861e+00 3.856e-01 4.826 1.40e-06 ***
## emp.var.rate -3.936e-02 4.868e-02 -0.809 0.41875
## cons.price.idx -9.007e-02 1.760e-02 -5.118 3.08e-07 ***
## cons.conf.idx 8.587e-03 2.214e-02 0.388 0.69813
## euribor3m 9.086e-04 2.672e-03 0.340 0.73388
## nr.employed -4.321e-01 7.497e-02 -5.764 8.23e-09 ***
## pdays_0 -2.760e-02 4.743e-02 -0.582 0.56060
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3277.2 on 2363 degrees of freedom
## Residual deviance: 1450.9 on 2312 degrees of freedom
## AIC: 1554.9
##
## Number of Fisher Scoring iterations: 11
##
## Call:
## glm(formula = y ~ job + education + contact + month + day_of_week +
## campaign + poutcome + cons.price.idx + nr.employed, family = "binomial",
## data = trainingsData2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7799 -0.8487 -0.2234 0.8065 2.0740
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.47260 0.36306 6.810 9.73e-12 ***
## jobblue-collar -0.03023 0.18442 -0.164 0.869797
## jobentrepreneur 0.32460 0.26398 1.230 0.218830
## jobhousemaid -0.01002 0.33984 -0.029 0.976476
## jobmanagement 0.19740 0.21032 0.939 0.347954
## jobretired 0.65718 0.27539 2.386 0.017017 *
## jobself-employed 0.02237 0.27994 0.080 0.936304
## jobservices 0.14278 0.20694 0.690 0.490240
## jobstudent 0.37249 0.28968 1.286 0.198491
## jobtechnician 0.15462 0.18105 0.854 0.393097
## jobunemployed 0.19178 0.32834 0.584 0.559153
## jobunknown 0.45448 0.56062 0.811 0.417550
## educationbasic.6y 0.27485 0.28290 0.972 0.331269
## educationbasic.9y 0.26757 0.22155 1.208 0.227152
## educationhigh.school 0.29699 0.22366 1.328 0.184237
## educationilliterate 11.77535 324.74389 0.036 0.971075
## educationprofessional.course 0.10327 0.24897 0.415 0.678297
## educationuniversity.degree 0.55574 0.22235 2.499 0.012440 *
## educationunknown 0.30765 0.29471 1.044 0.296526
## contacttelephone -0.56018 0.15255 -3.672 0.000241 ***
## monthaug -0.56257 0.22148 -2.540 0.011084 *
## monthdec 1.40186 1.06442 1.317 0.187835
## monthjul -0.13876 0.22235 -0.624 0.532586
## monthjun 0.28091 0.24770 1.134 0.256766
## monthmar 0.39582 0.37897 1.044 0.296266
## monthmay -0.96424 0.19709 -4.892 9.96e-07 ***
## monthnov -0.64135 0.22844 -2.808 0.004992 **
## monthoct -0.03536 0.35156 -0.101 0.919881
## monthsep 0.75049 0.63559 1.181 0.237689
## day_of_weekmon -0.33495 0.16360 -2.047 0.040627 *
## day_of_weekthu -0.09909 0.15932 -0.622 0.533975
## day_of_weektue -0.22911 0.16024 -1.430 0.152768
## day_of_weekwed -0.12754 0.15840 -0.805 0.420729
## campaign -0.04732 0.02489 -1.901 0.057276 .
## poutcomenonexistent 0.49247 0.16084 3.062 0.002199 **
## poutcomesuccess 1.96530 0.31788 6.183 6.31e-10 ***
## cons.price.idx -0.03878 0.01195 -3.246 0.001172 **
## nr.employed -0.26161 0.02568 -10.187 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3277.2 on 2363 degrees of freedom
## Residual deviance: 2483.1 on 2326 degrees of freedom
## AIC: 2559.1
##
## Number of Fisher Scoring iterations: 11
##
## Call:
## glm(formula = y ~ education + month + poutcome + emp.var.rate +
## cons.price.idx + nr.employed, family = "binomial", data = trainingsData2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7280 -0.8653 -0.2898 0.7726 1.9281
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.93862 0.29190 10.067 < 2e-16 ***
## educationbasic.6y 0.19722 0.27239 0.724 0.46904
## educationbasic.9y 0.17020 0.20988 0.811 0.41740
## educationhigh.school 0.24921 0.19243 1.295 0.19528
## educationilliterate 11.94159 324.74377 0.037 0.97067
## educationprofessional.course 0.10790 0.21235 0.508 0.61138
## educationuniversity.degree 0.49853 0.18518 2.692 0.00710 **
## educationunknown 0.30811 0.27294 1.129 0.25896
## monthaug -0.39956 0.23522 -1.699 0.08938 .
## monthdec 1.31327 1.05090 1.250 0.21142
## monthjul 0.10156 0.23137 0.439 0.66070
## monthjun 0.16464 0.24616 0.669 0.50361
## monthmar 0.28862 0.37423 0.771 0.44057
## monthmay -1.12716 0.19059 -5.914 3.34e-09 ***
## monthnov -0.86736 0.24104 -3.598 0.00032 ***
## monthoct -0.09388 0.33879 -0.277 0.78170
## monthsep 0.69254 0.63198 1.096 0.27316
## poutcomenonexistent 0.43500 0.15821 2.750 0.00597 **
## poutcomesuccess 1.94412 0.31647 6.143 8.09e-10 ***
## emp.var.rate -0.05198 0.02922 -1.779 0.07526 .
## cons.price.idx -0.05506 0.01097 -5.017 5.25e-07 ***
## nr.employed -0.26909 0.02694 -9.988 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3277.2 on 2363 degrees of freedom
## Residual deviance: 2511.9 on 2342 degrees of freedom
## AIC: 2555.9
##
## Number of Fisher Scoring iterations: 11
##
## Call:
## glm(formula = y ~ age + job + month + day_of_week + cons.price.idx +
## nr.employed + poutcome, family = "binomial", data = trainingsData2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7229 -0.8409 -0.2156 0.7854 1.9776
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.621869 0.348873 10.382 < 2e-16 ***
## age -0.009759 0.005424 -1.799 0.07199 .
## jobblue-collar -0.250931 0.147475 -1.702 0.08885 .
## jobentrepreneur 0.314275 0.258223 1.217 0.22358
## jobhousemaid -0.280920 0.328706 -0.855 0.39276
## jobmanagement 0.260030 0.207874 1.251 0.21097
## jobretired 0.542505 0.289440 1.874 0.06088 .
## jobself-employed -0.011980 0.274010 -0.044 0.96513
## jobservices -0.022122 0.192990 -0.115 0.90874
## jobstudent 0.106698 0.294131 0.363 0.71679
## jobtechnician -0.048868 0.156202 -0.313 0.75439
## jobunemployed 0.041558 0.320659 0.130 0.89688
## jobunknown 0.352375 0.540254 0.652 0.51425
## monthaug -0.514533 0.221117 -2.327 0.01997 *
## monthdec 1.212915 1.053470 1.151 0.24959
## monthjul -0.054247 0.218595 -0.248 0.80401
## monthjun 0.086295 0.239963 0.360 0.71913
## monthmar 0.278650 0.375858 0.741 0.45847
## monthmay -1.177753 0.190793 -6.173 6.70e-10 ***
## monthnov -0.648079 0.228295 -2.839 0.00453 **
## monthoct -0.204376 0.346627 -0.590 0.55545
## monthsep 0.628358 0.633112 0.992 0.32096
## day_of_weekmon -0.318913 0.162564 -1.962 0.04979 *
## day_of_weekthu -0.092060 0.157822 -0.583 0.55968
## day_of_weektue -0.224123 0.159016 -1.409 0.15871
## day_of_weekwed -0.127357 0.156469 -0.814 0.41568
## cons.price.idx -0.060041 0.010648 -5.639 1.71e-08 ***
## nr.employed -0.283311 0.025169 -11.256 < 2e-16 ***
## poutcomenonexistent 0.452561 0.160639 2.817 0.00484 **
## poutcomesuccess 1.995275 0.318730 6.260 3.85e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3277.2 on 2363 degrees of freedom
## Residual deviance: 2508.2 on 2334 degrees of freedom
## AIC: 2568.2
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = y ~ education + month + poutcome + emp.var.rate +
## cons.price.idx + nr.employed, family = "binomial", data = trainingsData2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7280 -0.8653 -0.2898 0.7726 1.9281
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.93862 0.29190 10.067 < 2e-16 ***
## educationbasic.6y 0.19722 0.27239 0.724 0.46904
## educationbasic.9y 0.17020 0.20988 0.811 0.41740
## educationhigh.school 0.24921 0.19243 1.295 0.19528
## educationilliterate 11.94159 324.74377 0.037 0.97067
## educationprofessional.course 0.10790 0.21235 0.508 0.61138
## educationuniversity.degree 0.49853 0.18518 2.692 0.00710 **
## educationunknown 0.30811 0.27294 1.129 0.25896
## monthaug -0.39956 0.23522 -1.699 0.08938 .
## monthdec 1.31327 1.05090 1.250 0.21142
## monthjul 0.10156 0.23137 0.439 0.66070
## monthjun 0.16464 0.24616 0.669 0.50361
## monthmar 0.28862 0.37423 0.771 0.44057
## monthmay -1.12716 0.19059 -5.914 3.34e-09 ***
## monthnov -0.86736 0.24104 -3.598 0.00032 ***
## monthoct -0.09388 0.33879 -0.277 0.78170
## monthsep 0.69254 0.63198 1.096 0.27316
## poutcomenonexistent 0.43500 0.15821 2.750 0.00597 **
## poutcomesuccess 1.94412 0.31647 6.143 8.09e-10 ***
## emp.var.rate -0.05198 0.02922 -1.779 0.07526 .
## cons.price.idx -0.05506 0.01097 -5.017 5.25e-07 ***
## nr.employed -0.26909 0.02694 -9.988 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3277.2 on 2363 degrees of freedom
## Residual deviance: 2511.9 on 2342 degrees of freedom
## AIC: 2555.9
##
## Number of Fisher Scoring iterations: 11
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "campaign" "previous"
## [13] "poutcome" "emp.var.rate" "cons.price.idx" "cons.conf.idx"
## [17] "euribor3m" "nr.employed" "y" "pdays_0"
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "duration" "campaign"
## [13] "previous" "poutcome" "emp.var.rate" "cons.price.idx"
## [17] "cons.conf.idx" "euribor3m" "nr.employed" "y"
## [21] "pdays_0"
## 54 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.260117386
## (Intercept) .
## age .
## jobblue-collar .
## jobentrepreneur .
## jobhousemaid .
## jobmanagement .
## jobretired .
## jobself-employed .
## jobservices .
## jobstudent .
## jobtechnician .
## jobunemployed .
## jobunknown .
## maritalmarried .
## maritalsingle .
## maritalunknown .
## educationbasic.6y .
## educationbasic.9y .
## educationhigh.school .
## educationilliterate .
## educationprofessional.course .
## educationuniversity.degree .
## educationunknown .
## defaultunknown .
## defaultyes .
## housingunknown .
## housingyes .
## loanunknown .
## loanyes .
## contacttelephone -0.299815182
## monthaug .
## monthdec .
## monthjul .
## monthjun .
## monthmar .
## monthmay -0.428366909
## monthnov .
## monthoct .
## monthsep .
## day_of_weekmon .
## day_of_weekthu .
## day_of_weektue .
## day_of_weekwed .
## campaign .
## previous .
## poutcomenonexistent .
## poutcomesuccess 0.460833002
## emp.var.rate .
## cons.price.idx .
## cons.conf.idx .
## euribor3m -0.002943405
## nr.employed -0.179885021
## pdays_0 .
## [1] "CV Error Rate:"
## [1] 0.2563452
## [1] "Penalty Value:"
## [1] 0.03797573
## 54 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 2.260283194
## (Intercept) .
## age .
## jobblue-collar .
## jobentrepreneur .
## jobhousemaid .
## jobmanagement .
## jobretired .
## jobself-employed .
## jobservices .
## jobstudent .
## jobtechnician .
## jobunemployed .
## jobunknown .
## maritalmarried .
## maritalsingle .
## maritalunknown .
## educationbasic.6y .
## educationbasic.9y .
## educationhigh.school .
## educationilliterate .
## educationprofessional.course .
## educationuniversity.degree .
## educationunknown .
## defaultunknown .
## defaultyes .
## housingunknown .
## housingyes .
## loanunknown .
## loanyes .
## contacttelephone -0.299767682
## monthaug .
## monthdec .
## monthjul .
## monthjun .
## monthmar .
## monthmay -0.428231546
## monthnov .
## monthoct .
## monthsep .
## day_of_weekmon .
## day_of_weekthu .
## day_of_weektue .
## day_of_weekwed .
## campaign .
## previous .
## poutcomenonexistent .
## poutcomesuccess 0.460759481
## emp.var.rate .
## cons.price.idx .
## cons.conf.idx .
## euribor3m -0.002952097
## nr.employed -0.179658867
## pdays_0 .
##
## Call:
## glm(formula = y ~ job + marital + education + default + contact +
## month + day_of_week + campaign + poutcome + emp.var.rate +
## cons.conf.idx + cons.price.idx + nr.employed, family = "binomial",
## data = trainingsData2L)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7722 -0.8487 -0.2170 0.8041 2.0854
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.449e+00 4.957e-01 4.939 7.84e-07 ***
## jobblue-collar -2.410e-02 1.852e-01 -0.130 0.89644
## jobentrepreneur 3.337e-01 2.667e-01 1.251 0.21093
## jobhousemaid -7.971e-03 3.396e-01 -0.023 0.98127
## jobmanagement 1.913e-01 2.120e-01 0.902 0.36687
## jobretired 6.528e-01 2.774e-01 2.353 0.01862 *
## jobself-employed -3.082e-03 2.808e-01 -0.011 0.99124
## jobservices 1.459e-01 2.075e-01 0.703 0.48202
## jobstudent 3.690e-01 2.964e-01 1.245 0.21322
## jobtechnician 1.504e-01 1.813e-01 0.830 0.40672
## jobunemployed 1.974e-01 3.285e-01 0.601 0.54781
## jobunknown 1.983e-01 6.006e-01 0.330 0.74126
## maritalmarried 8.689e-02 1.642e-01 0.529 0.59673
## maritalsingle 7.111e-02 1.799e-01 0.395 0.69270
## maritalunknown 1.247e+00 1.007e+00 1.238 0.21567
## educationbasic.6y 2.549e-01 2.842e-01 0.897 0.36983
## educationbasic.9y 2.471e-01 2.237e-01 1.105 0.26920
## educationhigh.school 2.783e-01 2.271e-01 1.226 0.22036
## educationilliterate 1.173e+01 3.247e+02 0.036 0.97119
## educationprofessional.course 8.871e-02 2.511e-01 0.353 0.72387
## educationuniversity.degree 5.344e-01 2.263e-01 2.361 0.01821 *
## educationunknown 2.903e-01 2.965e-01 0.979 0.32750
## defaultunknown -1.082e-01 1.443e-01 -0.750 0.45350
## contacttelephone -5.280e-01 1.792e-01 -2.947 0.00321 **
## monthaug -4.972e-01 2.977e-01 -1.670 0.09493 .
## monthdec 1.439e+00 1.072e+00 1.343 0.17929
## monthjul -7.542e-02 2.413e-01 -0.313 0.75461
## monthjun 3.146e-01 2.530e-01 1.244 0.21365
## monthmar 4.018e-01 3.788e-01 1.061 0.28874
## monthmay -9.460e-01 2.008e-01 -4.710 2.47e-06 ***
## monthnov -7.021e-01 3.052e-01 -2.300 0.02145 *
## monthoct -9.108e-03 3.721e-01 -0.024 0.98047
## monthsep 7.869e-01 6.527e-01 1.206 0.22798
## day_of_weekmon -3.316e-01 1.639e-01 -2.023 0.04306 *
## day_of_weekthu -1.047e-01 1.600e-01 -0.654 0.51293
## day_of_weektue -2.331e-01 1.606e-01 -1.451 0.14665
## day_of_weekwed -1.413e-01 1.593e-01 -0.887 0.37509
## campaign -4.627e-02 2.488e-02 -1.860 0.06293 .
## poutcomenonexistent 4.916e-01 1.606e-01 3.060 0.00221 **
## poutcomesuccess 1.961e+00 3.176e-01 6.174 6.66e-10 ***
## emp.var.rate -1.889e-02 3.926e-02 -0.481 0.63036
## cons.conf.idx 3.209e-04 1.574e-02 0.020 0.98374
## cons.price.idx -3.765e-02 1.285e-02 -2.931 0.00338 **
## nr.employed -2.538e-01 3.327e-02 -7.630 2.34e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3277.2 on 2363 degrees of freedom
## Residual deviance: 2480.4 on 2320 degrees of freedom
## AIC: 2568.4
##
## Number of Fisher Scoring iterations: 11
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = trainingsData2L)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8537 -0.8488 -0.2177 0.8089 2.1028
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.812371 0.634197 4.435 9.23e-06 ***
## age -0.007469 0.006196 -1.206 0.228009
## jobblue-collar -0.049024 0.186391 -0.263 0.792538
## jobentrepreneur 0.319826 0.267001 1.198 0.230977
## jobhousemaid 0.025708 0.338817 0.076 0.939517
## jobmanagement 0.199103 0.213673 0.932 0.351434
## jobretired 0.794598 0.303920 2.614 0.008936 **
## jobself-employed -0.004774 0.281532 -0.017 0.986471
## jobservices 0.123454 0.208364 0.592 0.553522
## jobstudent 0.282249 0.303763 0.929 0.352798
## jobtechnician 0.142048 0.181629 0.782 0.434170
## jobunemployed 0.189268 0.329348 0.575 0.565510
## jobunknown 0.229169 0.599743 0.382 0.702379
## maritalmarried 0.067850 0.165224 0.411 0.681327
## maritalsingle -0.008685 0.191101 -0.045 0.963752
## maritalunknown 1.199728 1.005500 1.193 0.232804
## educationbasic.6y 0.238947 0.285765 0.836 0.403062
## educationbasic.9y 0.224050 0.225524 0.993 0.320484
## educationhigh.school 0.233537 0.230729 1.012 0.311456
## educationilliterate 11.779753 324.743901 0.036 0.971064
## educationprofessional.course 0.055129 0.253175 0.218 0.827624
## educationuniversity.degree 0.496164 0.229259 2.164 0.030448 *
## educationunknown 0.289447 0.297665 0.972 0.330855
## defaultunknown -0.081148 0.145868 -0.556 0.577997
## housingunknown -0.192886 0.337801 -0.571 0.567996
## housingyes -0.047815 0.101232 -0.472 0.636689
## loanunknown NA NA NA NA
## loanyes -0.148812 0.143864 -1.034 0.300952
## contacttelephone -0.511631 0.180910 -2.828 0.004683 **
## monthaug -0.456237 0.306334 -1.489 0.136397
## monthdec 1.376588 1.079201 1.276 0.202111
## monthjul -0.115978 0.243464 -0.476 0.633815
## monthjun 0.249806 0.258254 0.967 0.333400
## monthmar 0.391764 0.380320 1.030 0.302968
## monthmay -0.947464 0.202235 -4.685 2.80e-06 ***
## monthnov -0.730936 0.310600 -2.353 0.018608 *
## monthoct 0.005452 0.373545 0.015 0.988355
## monthsep 0.846085 0.658780 1.284 0.199030
## day_of_weekmon -0.328607 0.164240 -2.001 0.045417 *
## day_of_weekthu -0.111421 0.160408 -0.695 0.487301
## day_of_weektue -0.233664 0.161139 -1.450 0.147037
## day_of_weekwed -0.140542 0.159695 -0.880 0.378822
## campaign -0.045726 0.024964 -1.832 0.066999 .
## previous -0.016070 0.204675 -0.079 0.937417
## poutcomenonexistent 0.503337 0.287042 1.754 0.079511 .
## poutcomesuccess 1.887159 0.362979 5.199 2.00e-07 ***
## emp.var.rate -0.021809 0.039639 -0.550 0.582190
## cons.price.idx -0.032389 0.014911 -2.172 0.029848 *
## cons.conf.idx 0.005024 0.016965 0.296 0.767134
## euribor3m -0.001953 0.002271 -0.860 0.389699
## nr.employed -0.207482 0.062826 -3.303 0.000958 ***
## pdays_0 0.022405 0.043187 0.519 0.603909
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3277.2 on 2363 degrees of freedom
## Residual deviance: 2476.3 on 2313 degrees of freedom
## AIC: 2578.3
##
## Number of Fisher Scoring iterations: 11
According to the AUC of the ROC curves, Lasso appears to have an upper hand.
## model AIC AUC
## 1 original 2559.079 0.784
## 2 stepwise 2555.858 0.933
## 3 forward 2568.152 0.782
## 4 backwarcd 2555.858 0.782
## 5 lasso 2568.400 0.778
According to the AUC, stepwise, forward, and Lasso have the highest area under the curve.
While the AIC of the three models are relatively close, stepwise is the lowest of the 5 models tested.
## (Intercept) educationbasic.6y
## 1.888976e+01 1.218015e+00
## educationbasic.9y educationhigh.school
## 1.185541e+00 1.283016e+00
## educationilliterate educationprofessional.course
## 1.535204e+05 1.113933e+00
## educationuniversity.degree educationunknown
## 1.646304e+00 1.360848e+00
## monthaug monthdec
## 6.706119e-01 3.718323e+00
## monthjul monthjun
## 1.106896e+00 1.178965e+00
## monthmar monthmay
## 1.334589e+00 3.239509e-01
## monthnov monthoct
## 4.200598e-01 9.103909e-01
## monthsep poutcomenonexistent
## 1.998782e+00 1.544962e+00
## poutcomesuccess emp.var.rate
## 6.987491e+00 9.493502e-01
## cons.price.idx nr.employed
## 9.464297e-01 7.640763e-01
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values
## OR 2.5 % 97.5 %
## (Intercept) 1.888976e+01 1.070782e+01 33.6505670
## educationbasic.6y 1.218015e+00 7.116020e-01 2.0728255
## educationbasic.9y 1.185541e+00 7.864294e-01 1.7914535
## educationhigh.school 1.283016e+00 8.813852e-01 1.8749398
## educationilliterate 1.535204e+05 2.800916e-24 NA
## educationprofessional.course 1.113933e+00 7.351269e-01 1.6908780
## educationuniversity.degree 1.646304e+00 1.147383e+00 2.3724508
## educationunknown 1.360848e+00 7.968346e-01 2.3252260
## monthaug 6.706119e-01 4.218880e-01 1.0617838
## monthdec 3.718323e+00 7.149775e-01 68.4416690
## monthjul 1.106896e+00 7.017134e-01 1.7395109
## monthjun 1.178965e+00 7.271880e-01 1.9104463
## monthmar 1.334589e+00 6.601269e-01 2.8962979
## monthmay 3.239509e-01 2.216778e-01 0.4683999
## monthnov 4.200598e-01 2.618415e-01 0.6747519
## monthoct 9.103909e-01 4.756954e-01 1.8067027
## monthsep 1.998782e+00 6.674132e-01 8.6480245
## poutcomenonexistent 1.544962e+00 1.133581e+00 2.1086789
## poutcomesuccess 6.987491e+00 3.865071e+00 13.4717836
## emp.var.rate 9.493502e-01 8.970638e-01 1.0061900
## cons.price.idx 9.464297e-01 9.263164e-01 0.9671140
## nr.employed 7.640763e-01 7.240384e-01 0.8048206
The cut off selection was manually iterated through, referencing the ROC curve to determine the best cutoff.
## Model Accuracy Sensitivity Specificity Average Cutoff
## 1 Step 0.8740985 0.877764 0.8366108 0.8628244 0.4
## Model Accuracy Sensitivity Specificity Average Cutoff
## 1 Original 0.8477488 0.8723916 0.5957201 0.7719535 0.4
## 2 Step 0.8740985 0.8777640 0.8366108 0.8628244 0.4
## 3 Forward 0.8464094 0.8708929 0.5960093 0.7711039 0.4
## 4 Backward 0.8448382 0.8691964 0.5957201 0.7699182 0.4
## 5 Lasso 0.8842211 0.9229486 0.4881434 0.7651044 0.4
The goal of model 2 is to increase predictability despite the loss of interpretability.
## Model Accuracy Sensitivity Specificity Average Cutoff
## 1 evr*cpi + cpi*em 0.8525139 0.8771136 0.6009254 0.776851 0.6
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## poutcome month nr.employed
## 4.442001 5.535462 32.676342
## emp.var.rate cons.price.idx euribor3m
## 90.833566 5.175381 9.946142
## duration month:duration nr.employed:emp.var.rate
## 21.213439 6.659970 208.299506
## poutcome:duration
## 7.540782
## [1] 0.8693592
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.4878 -3.7967 -2.4126 -1.7493 -0.3549 42.0157
## Model Accuracy Sensitivity Specificity Average Cutoff
## 1 Complex Log Model 0.8494746 0.8448510 0.8967611 NA 0.00
## 2 Complex Log Model 0.8527200 0.8490358 0.8903991 NA 0.05
## 3 Complex Log Model 0.8553472 0.8528813 0.8805668 0.8629318 0.10
## 4 Complex Log Model 0.8588760 0.8572923 0.8750723 0.8629318 0.15
## 5 Complex Log Model 0.8621986 0.8613923 0.8704453 0.8629318 0.20
## 6 Complex Log Model 0.8654183 0.8656337 0.8632157 0.8629318 0.25
## 7 Complex Log Model 0.8693592 0.8704123 0.8585888 0.8629318 0.30
## 8 Complex Log Model 0.8725015 0.8747102 0.8499132 0.8629318 0.35
## 9 Complex Log Model 0.8754379 0.8785840 0.8432620 0.8629318 0.40
## 10 Complex Log Model 0.8778075 0.8820336 0.8345865 0.8629318 0.45
## 11 Complex Log Model 0.8802287 0.8852853 0.8285136 0.8629318 0.50
## 12 Complex Log Model 0.8825211 0.8886501 0.8198381 0.8629318 0.55
## 13 Complex Log Model 0.8846332 0.8917039 0.8123193 0.8629318 0.60
## 14 Complex Log Model 0.8869771 0.8950687 0.8042221 0.8629318 0.65
## 15 Complex Log Model 0.8889347 0.8979811 0.7964141 0.8629318 0.70
## 16 Complex Log Model 0.8909180 0.9011197 0.7865818 0.8629318 0.75
## 17 Complex Log Model 0.8931589 0.9044845 0.7773279 0.8629318 0.80
## 18 Complex Log Model 0.8949619 0.9074818 0.7669173 0.8629318 0.85
## 19 Complex Log Model 0.8965331 0.9106486 0.7521689 0.8629318 0.90
## 20 Complex Log Model 0.8981558 0.9132500 0.7437825 0.8629318 0.95
## 21 Complex Log Model 0.8992633 0.9155686 0.7325043 0.8629318 1.00
PCA2 vs PCA4 has a clear seperation, but the ones before don’t seem to have such a clear seperation.
Create another competing model using just the continuous predictors and use LDA or QDA
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 19528 57
## yes 15838 3401
##
## Accuracy : 0.5906
## 95% CI : (0.5857, 0.5955)
## No Information Rate : 0.9109
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1751
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5522
## Specificity : 0.9835
## Pos Pred Value : 0.9971
## Neg Pred Value : 0.1768
## Prevalence : 0.9109
## Detection Rate : 0.5030
## Detection Prevalence : 0.5045
## Balanced Accuracy : 0.7678
##
## 'Positive' Class : no
##
Running an LDA on the PCA variables
## [1] 23
## [1] 23
##
## Classification tree:
## tree(formula = y ~ ., data = tree.data)
## Variables actually used in tree construction:
## [1] "nr.employed" "pdays" "month"
## Number of terminal nodes: 4
## Residual mean deviance: 0.5694 = 23450 / 41180
## Misclassification error rate: 0.1005 = 4140 / 41188
##
## Classification tree:
## tree(formula = y ~ ., data = tree.data.train)
## Variables actually used in tree construction:
## [1] "euribor3m" "nr.employed"
## Number of terminal nodes: 4
## Residual mean deviance: 1.099 = 2594 / 2360
## Misclassification error rate: 0.2563 = 606 / 2364
## Warning in prune.tree(tree.bank, best = 5): best is bigger than tree size
##
## Classification tree:
## tree(formula = y ~ nr.employed + pdays + month + cons.price.idx +
## campaign + contact + education + age, data = tree.data, minsize = 5)
## Variables actually used in tree construction:
## [1] "nr.employed" "pdays" "month"
## Number of terminal nodes: 4
## Residual mean deviance: 0.5694 = 23450 / 41180
## Misclassification error rate: 0.1005 = 4140 / 41188
## no yes
## 30192 8632
##
## fit.pred no yes
## no 28968 1224
## yes 6398 2234
## [1] "Confusion matrix for LRF"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11210 2524
## yes 24156 934
##
## Accuracy : 0.3128
## 95% CI : (0.3082, 0.3174)
## No Information Rate : 0.9109
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.108
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.31697
## Specificity : 0.27010
## Pos Pred Value : 0.81622
## Neg Pred Value : 0.03723
## Prevalence : 0.91093
## Detection Rate : 0.28874
## Detection Prevalence : 0.35375
## Balanced Accuracy : 0.29353
##
## 'Positive' Class : no
##
## [1] "Overall accuracy for RF "
## [1] 0.3127962
## no yes
## 35366 3458
social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)